Movies Recommendation in Flixster

Data Mining II

Beatriz Pinto 201503815, Filipe Justiça 201606339, Tiago Coelho 201604170

Faculdade de Ciências da Universidade do Porto

Introduction

Context and Motivation


   Recommendation Systems have become very popular and an essential key to the success of most social media platforms, online stores and other media-services distributors.
  The recommendation task is one of the Web Mining applications that has been studied and developed to find patterns and learn information taking advantage of the large amounts of user data available from online usage. Companies like Youtube, Netflix or Amazon have been using these approaches to understand their clients and users needs and provide them recommendations that will improve their engagement in the platform. Another common application of Web Usage Mining is the discovery of user communities and profiles to have more efficient advertising. Target and recommended adds are being used more and more each day, and since publicity is transversal and applicable to almost every type of commercial businesses there is really a lot work to be done.
  Dealing with such large and sparse inputs of text, metadata, images and other log information from web usage can be very challenging computationally, but once we are able to manage it the applications have a lot of potential to enhance both the user experience and the companies' reach and profit. Therefore it is mandatory to anyone working in Web Mining to understand and know how to apply the existing recommendation systems approaches.
  With this work we aim to do exactly this and apply different types of recommendation algorithms that we have learned to real world data and compare their performances.

Problem Definition and Methodology


  The Recommendation Problem is typically characterized by 2 objects: a set of users and set of items to be recommended to the users, and the goal is learning the function that maps the relevance and usefulness of a particular item for a user. There are two main prediction tasks that can be done depending on the context and data in our problem:
  • Item Prediction: Gives an ordered list of items which are likely to be consumed or liked by the user on the future;
  • Rating prediction: Gives a predicted rating that the user is likely to give to a certain item that he hasn't seen yet.

  In the proposed assignment we are dealing with data collected from the website Flixter which is a social movie platform where the users can look for movie recommendations and also share their opinions and give ratings to movies they have seen. Given this we have the standard recommendation problem setting where our task is to predict the best movie recommendations to give to a user or the rating that a user might give to a movie.
  In order to achieve this goal we start by performing an Exploratory Data Analysis where we first get to know the Datasets and informations available and then delve into the main features and behaviors captured by this data. Here we also do some Data Visualization that helped us understand which attributes are important to include in our models, which time periods are statistically significant, among others.
  The following step was to build our recommendation algorithms. In the section Recommendation Models we begin by briefly explaining and giving examples of the reasoning behind each model's approach, followed by the experimental steps and implementation. The algorithms we have decided to work are based on the following approaches:
  • Popularity
  • Association Rules
  • Collaborative filtering
  • Hybrid Approaches

And lastly we use several evaluation metrics to perform tests and comparisons between the performances of the models' implementations which we explain and show the results in a intuitive and visual way. To do this we train our models to learn the best recommendations in a given time period and then test those recommendations based on what the users have watched in reality in the following time period. For each model we calculate the Precision, Recall, F1-score and the percentage of users that have watched at least one of our recommendations, and those measures are given by the following:

  $Precision=\frac{number \, of \, movie \, recommendations \, the \, user \, has \, watched}{number \,of \, movies \, that \, were \, recommended}$

  $Recall=\frac{number \, of \, movie \, recommendations \, the \, user \, has \, watched}{number \,of \, movies \, that \, the \, user \, has \, watched \, in \, the \, same \, period}$

  $F1\, score=\frac{2*Recall*Precision}{Recall+Precision}$

  It is important to note that the number of movies that the user has watched in the same period is not the total number of movies the user has watched, but the number of movies he has watched within the subset of possible recommendations which we considered to be the 50 most popular movies of that time period.

   For each type of approach we test the models for different numbers of recommendations that are given to the user, and then try to choose the best number based on the performance measures above. Here it is important to understand that according to the way the recall is calculated, the more recommendations we give, the higher will be the recall. Therefore there should be a compromise between this and the number of recommendations given, because it would not make sense to suggest too many movies to the user. After having the selected number of recommendation for each model we consecutively train and test all the models trough a sliding window to get the final results.

Exploratory Data Analysis

Handling the Datasets : Pre-processing


  We start we the given zip files which are composed of 4 datasets: one with the list of ratings (with 4 columns for the id of the user -userid- and the movie -movieid- in the rating, the score -rating- and the date and time -date-) a metadata file with the user profile information (with 7 attributes: userid, gender, location, memberfor, lastlogin, profileview and age). and the last 2 files are actually the same and consist of the movie metadata (associating the movieid to the name of the movie).
  Since the previous datasets were text files, we have decided to convert them to the csv format for an easier pre-processing phase. This type of file is better since we can easily import the data to our environment using the pandas package , which handles separations and data manipulation very well, and quickly create our desired dataframe. These file contain exactly the same information and attributes and the text files, we have just converted them and simplified the names.
  So the first thing to do is import the necessary packages (numpy, pandas, plotly and datetime) and our ratings, movies and users files. While doing this we applied the pandas built-in methods to handle the different types of delimiters, separations and encoding of the files and also to convert the format of the date and memberfor entries to the correct python datetime format for easier access in future manipulations. So now we are ready to check our dataframes and do any necessary pre-processing taks.
  We start by visualizing our 3 dataframes and their main characteristics by calling the routines head() and info(). We first look at the ratings table and check that it has 4 columns and 8196078 rows. The data types are ok, but for consistency we have decided to change the data types of userid and movieid values to int (previously types object and float, respectively). But before doing this we first checked any missing or repeated values. For this dataframe only the last row had any missing data, which we therefore have removed, and there were no repeated entries.
In [1]:
%%capture


from __future__ import division
from datetime import datetime, timedelta
import pandas as pd
import numpy as np


import chart_studio.plotly as py
import plotly.offline as pyoff
import plotly.graph_objs as go
from plotly.subplots import make_subplots

users=pd.read_csv("users.csv",delimiter=";")
movies=pd.read_csv("movies.csv",delimiter=";",encoding = "ISO-8859-1")
ratings=pd.read_csv("ratings.csv",sep='\t', lineterminator='\r',encoding = "UTF-16 LE")
ratings['date']= pd.to_datetime(ratings['date']) 
users['memberfor']= pd.to_datetime(users['memberfor'],format='%d/%m/%Y %H:%M')
In [2]:
ratings.head()
Out[2]:
userid movieid rating date
0 882359 81.0 1.5 2007-10-10
1 882359 926.0 1.0 2007-10-10
2 882359 1349.0 2.0 2007-10-10
3 882359 2270.0 1.0 2007-10-10
4 882359 3065.0 5.0 2007-12-29
In [3]:
#print(ratings.info())
#print(ratings.isnull().sum())
#print(ratings.iloc[-1])
ratings.drop([8196077],inplace=True)
#print(ratings.isnull().sum())
ratings.isnull().sum()
ratings['movieid']=pd.to_numeric(ratings['movieid'],downcast='integer')
ratings['userid']=pd.to_numeric(ratings['userid'],downcast='integer')
print(ratings.info())
print('duplicates: ',ratings.duplicated().sum())
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8196077 entries, 0 to 8196076
Data columns (total 4 columns):
 #   Column   Dtype         
---  ------   -----         
 0   userid   int32         
 1   movieid  int32         
 2   rating   float64       
 3   date     datetime64[ns]
dtypes: datetime64[ns](1), float64(1), int32(2)
memory usage: 250.1 MB
None
duplicates:  0
  Next we look at the users table and again, checked the data types, missing and repeated values. We left the data types as they were and there we no repeated values. Nevertheless there is a significant number of missing values in the age column (more than 25% of the entries), and also in the gender and lastlogin information (less than 7%). This can make it more demanding to find similarities and communities between users based on other things but their ratings, and complicates the creation of user profiles.
In [4]:
#print(users.head())
print(users.info())
#print(users.isnull().sum())
print('duplicates: ',users.duplicated().sum())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1002796 entries, 0 to 1002795
Data columns (total 7 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   userid       1002796 non-null  int64         
 1   gender       935267 non-null   object        
 2   location     1002593 non-null  float64       
 3   memberfor    1002593 non-null  datetime64[ns]
 4   lastlogin    944871 non-null   float64       
 5   profileview  747178 non-null   float64       
 6   age          747178 non-null   float64       
dtypes: datetime64[ns](1), float64(4), int64(1), object(1)
memory usage: 53.6+ MB
None
duplicates:  0
  Lastly we have the movies dataframe, for which the same process has been repeated. With only 2 columns, this table did not have any duplicates or missing values, so no modifications were needed. After this making sure the columns had the same names in the 3 dataframes, we are ready to explore and get a deeper analysis of the data.
In [5]:
#print(movies.head())
print(movies.info())
#print(movies.isnull().sum())
print('duplicates: ',movies.duplicated().sum())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66730 entries, 0 to 66729
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   moviename  66730 non-null  object
 1   movieid    66730 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 1.0+ MB
None
duplicates:  0

Data Visualization


  In order to understand better our set of users we start by characterizing the population in a histogram for the age. The result is in the following image and tells us that most users are between 18 and 30 years old. Another thing to take in consideration is that, even though there 1976 users with ages between 107 and 109, there are almost no users older than 72 and younger than 107. We think this values might be due to a systematic error and are probably outliers, but still decided to consider their ratings.
In [6]:
plot_data = [
    go.Histogram(
        x=users['age'],
    )
]

plot_layout = go.Layout(
        title='Age Distriution', 
        xaxis_title="Age",
        yaxis_title="Number of Users"
        
    )

fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
  Next we studied the daily number of new registered users and plot the respective histogram using the memberfor date values. As we can see, there are only new registrations in the first 3 days of each month which is unsual, but can be due some to the subscription or the way the system is programmed to update the user data. With this is mind we thought it would be relevant to check if there are ratings made by a user previously to their registered date of becoming a member. So when this happens we change the memberfor field to the date or the user's earliest rating. Another thing to be noticed is that there are entries that belong to the year of 1900 which is of course impossible and therefore those users will not be considered.
In [7]:
tx_data=users
tx_data['memberforYearMonthDay'] = tx_data['memberfor'].map(lambda date: date )


newUsersmonth = tx_data.groupby(['memberforYearMonthDay'])['userid'].nunique().reset_index()
plot_data = [
    go.Bar(
        x=newUsersmonth['memberforYearMonthDay'],
        y=newUsersmonth['userid'],
    )
]

plot_layout = go.Layout(
        xaxis={"type": "category"},
        yaxis_title="Number of Users",
        title='New users'
    )

fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
  And then we look at the volume of ratings made every month and checked that the first semester of 2007 was the period when more ratings were added to the site. This information will be important for later when making predictions and testing their performance for the several recommendation systems.
In [8]:
tx_data=ratings
tx_data['date'] = tx_data['date'].map(lambda date: 100*date.year + date.month )


newcomentsmonth = tx_data.groupby(['date'])['userid'].count().reset_index()
plot_data = [
    go.Bar(
        x=newcomentsmonth['date'],
        y=newcomentsmonth['userid'],
    )
]

plot_layout = go.Layout(
        xaxis={"type": "category"},
        yaxis_title="Number of ratings",
        title='New monthly ratings'
    )

fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
In [9]:
tx_data=ratings
tx_data=pd.merge(tx_data,movies, on="movieid", how="left")
tx_data.head()
views = tx_data.groupby(['movieid'])['userid'].count().reset_index()
views=views.sort_values(by='userid', ascending=False)
views=views.rename(columns={"userid": "Views"})
views=pd.merge(views,movies, on="movieid", how="right")
views.head();


  We now look for the top 10 most popular movies of all time registered in the data. To do this, a new dataframe is created by joining the entries in ratings by the value of movieid. First we show the top 10 movies with more ratings, meaning higher number of views, in the following histogram. And then compare with a top 10 calculated with a weighted mean of the number of views and the mean rating score given by the users.
  The list of movies is not the same but the top 4 most viewed were also the best rated in this group, so there is probably a correlation with the number of views and the score of the movie.

In [10]:
plot_data = [
    go.Bar(
        x=views['movieid'][0:10],
        y=views['Views'][0:10],
        text=views["moviename"][0:10] ,textposition='inside', marker_color='lightsalmon'
    )
]

plot_layout = go.Layout(
        xaxis={"type": "category"},
        title='Top 10 most viewed movies',
        xaxis_title="movieid",
        yaxis_title="Views"
        
    )

fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
movierank = tx_data.groupby(['movieid'])['rating'].mean().reset_index()
movierank = pd.merge(movierank,views, on="movieid", how="right")
movierank["FinalRating"]=(movierank["rating"]/5*0.7+0.3*movierank["Views"]/np.max(movierank["Views"]))*5
movierank = movierank.sort_values(by='FinalRating', ascending=False)
movierank.head();
plot_data = [
    go.Bar(
        x=movierank['movieid'][0:10],
        y=movierank['FinalRating'][0:10],
        text=movierank["moviename"][0:10] ,textposition='inside', marker_color='lightsalmon'
    )
]

plot_layout = go.Layout(
        xaxis={"type": "category"},
        title='Top 10 movies with best rated and most views',
        xaxis_title="movieid",
        yaxis_title="Rating"
        
    )

fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
  In the next images we represent the number of views and the rating score of the movies to ilustrate this relationship. The scatter plot shows a rather noisy correlation where the rating score is proportional to the number of views. The second image shows the same representation but with the weighted rating score (weighted average of the number of views and mean score) instead and filtering the movies with less than 101 views.
In [11]:
plot_data = [
    go.Scatter(
        x=movierank.query("Views>100")['Views'],
        y=movierank.query("Views>100")['rating'],
        mode='markers'
    )
]

plot_layout = go.Layout(
        title='Rating score vs. number of views',
        xaxis_title="views",
        yaxis_title="rating score"
        
    )

fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
plot_data = [
    go.Scatter(
        x=movierank.query("Views>100")['Views'],
        y=movierank.query("Views>100")['FinalRating'],
        mode='markers',
        hovertext=movierank.query("Views>100")["moviename"]
    )
]

plot_layout = go.Layout(
        title="Final Rating (weighted averagem) vs. number of views",
        xaxis_title="views",
        yaxis_title="views"
        
    )

fig = go.Figure(data=plot_data, layout=plot_layout)
pyoff.iplot(fig)
  And finally we create sets of users divided by age groups and represent the top 10 most viewed movies for each in a pie chart. The results show that, although some movies are within the most viewed for all age groups, the others can be very different and with a varying number of views. Therefore we need to take this into account when recommending a movie to a user, specially in the model of recommendations by popularity.
In [12]:
def most_viewed_age(age1,age2):
  tx_data=ratings
  tx_data=pd.merge(tx_data,movies, on="movieid", how="left")
  tx_data=pd.merge(tx_data,users[["userid","age"]], on="userid", how="left")
  tx_age=tx_data.query(str(age1)+"<=age<="+str(age2))
  views = tx_age.groupby(['movieid'])['userid'].count().reset_index()
  views=views.sort_values(by='userid', ascending=False)
  views=views.rename(columns={"userid": "Views"})
  views=pd.merge(views,movies, on="movieid", how="right")
  return views

#most_viewed_age(20,20)

specs = [[{'type':'domain'}, {'type':'domain'}], [{'type':'domain'}, {'type':'domain'}]]
subplot_titles=['Age 10-20', 'Age 20-30','Age 30-40','Age 40-50']
fig = make_subplots(rows=2, cols=2, specs=specs, subplot_titles=subplot_titles)
fig.add_trace(go.Pie(labels=most_viewed_age(10,20)["moviename"][0:10], 
                     values=most_viewed_age(10,20)["Views"][0:10]), 1, 1)
fig.add_trace(go.Pie(labels=most_viewed_age(20,30)["moviename"][0:10], 
                     values=most_viewed_age(20,30)["Views"][0:10]), 2, 1)
fig.add_trace(go.Pie(labels=most_viewed_age(30,40)["moviename"][0:10], 
                     values=most_viewed_age(30,40)["Views"][0:10]), 1, 2)
fig.add_trace(go.Pie(labels=most_viewed_age(40,50)["moviename"][0:10], 
                     values=most_viewed_age(40,50)["Views"][0:10]), 2, 2)

# Tune layout and hover info
fig.update_traces(hoverinfo='label+percent+name', textinfo='label', textposition='inside')
fig.update(layout_title_text='Mais vistos por idade',
           layout_showlegend=False)

fig = go.Figure(fig)
fig.show()

Recommendation Systems

Popularity


  Recommendations based on the item popularity are the simplest type of recommendation system to build. It is very similar to what we have done in the Exploratory Analysis when we show the top 10 movies with most views or best final rating score, with the difference that now we wish to have personalized user recommendations. This is because there is no interest in suggesting a movie to a user which he has already seen, so we can't just recommend the same list of popular movies to everyone.
  Given this our approach will be the following:
  • Considering a definite time window (we used 1 month periods) calculate the ordered list of the most watched movies;
  • For each user that was active in that time period (made at least one rating) determine if it has previously watched or not each movie in the mentioned list;
  • Select the first 10 most viewed movies from that list which were not seen by a particular user to give him a personalized recommendation;
In [72]:
users=pd.read_csv("users.csv",delimiter=";")
movies=pd.read_csv("movies.csv",delimiter=";",encoding = "ISO-8859-1")
ratings=pd.read_csv("ratings.csv",sep='\t', lineterminator='\r',encoding = "UTF-16 LE")
ratings['date']= pd.to_datetime(ratings['date']) 
users['memberfor']= pd.to_datetime(users['memberfor'],format='%d/%m/%Y %H:%M')
ratings.drop([8196077],inplace=True)
#print(ratings.isnull().sum())
ratings.isnull().sum()
ratings['movieid']=pd.to_numeric(ratings['movieid'],downcast='integer')
ratings['userid']=pd.to_numeric(ratings['userid'],downcast='integer')

###function that filters the movieids and names to have an ordered list of the most rated/seen movies in a time
#window starting in date 1 and ending in date 2
def pop_by_data(date1,date2):
    d1=pd.to_datetime(date1, format='%d/%m/%Y')
    d2=pd.to_datetime(date2, format='%d/%m/%Y')
    mask = (ratings['date'] >= d1) & (ratings['date'] <= d2)
    tx_data=ratings.loc[mask]
    tx_data=pd.merge(tx_data,movies, on="movieid", how="left")
    views = tx_data.groupby(['movieid'])['userid'].count().reset_index()
    views=views.sort_values(by='userid', ascending=False)
    views=views.rename(columns={"userid": "Views"})
    views=pd.merge(views,movies, on="movieid", how="left")
    return views
%timeit pop_by_data('01/05/2007','01/06/2007')
pop_by_data('01/05/2007','01/06/2007')



###this function returns a dataframe that tell us if a movie was watched by
'''
def pivot(date1,date2):
  d2=pd.to_datetime(date2, format='%d/%m/%Y')
  tx_user = ratings.loc[ratings['date'] <= d2]
  tx_user=tx_user[tx_user.movieid.isin(pop_by_data(date1,date2)["movieid"][0:10])]
  index=pd.pivot_table(tx_user, index='userid', columns='movieid', values='rating')
  return index
#%timeit pivot('01/05/2007','01/06/2007')
pivot('01/05/2007','01/06/2007')
'''

def revert_pivot(date1,date2):
  d2=pd.to_datetime(date2, format='%d/%m/%Y')
  tx_user = ratings.loc[ratings['date'] <= d2]
  tx_user=tx_user[tx_user.movieid.isin(pop_by_data(date1,date2)["movieid"][0:50])]
  index=pd.pivot_table(tx_user, index='movieid', columns='userid', values='rating')
  new_order=pop_by_data(date1,date2)["movieid"][0:50].tolist()
  index=index.reindex(new_order)
  return index
rp=revert_pivot('01/05/2007','01/06/2007')
rp.columns


def r_lista(userid,rp,n):
  return rp[np.isnan(rp[userid])].index.tolist()[0:n]

def table(date1,date2,N):
  rp=revert_pivot(date1,date2)
  users=rp.columns.to_numpy()
  vfunc = np.vectorize(r_lista, excluded=["rp","n"], otypes=[list])
  return pd.DataFrame(data={"userid":users, "recommended_p_movieids":vfunc(users,rp=rp,n=N)})
173 ms ± 16.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


   After having implemented this steps we need to check how relevant and accurate our recommendations would have been for the users. To have this information, we look at the data, in particular the ratings, made in the following period (in this case 1 month again) and compare the list of movies watched by the user head to head with the list of suggested movies we got for that user.
  The following table shows an example of the results of the model when we give 5 recommendations to the user. For each user we have the list of 5 movies we had recommended (based in the ones he had already seen in the past and the most popular movies in the previous month), the list of movies he has in fact watch in the following month, and the number of successful recommendations (0-5: number of movies from the list of recommendations that he has seen).

In [27]:
def revert_pivot_after(date1,date2,date3):
  d2=pd.to_datetime(date2, format='%d/%m/%Y')
  d3=pd.to_datetime(date3, format='%d/%m/%Y')
  tx_user = ratings.loc[(ratings['date'] <= d3) & (ratings['date'] >= d2)]
  tx_user=tx_user[tx_user.movieid.isin(pop_by_data(date1,date2)["movieid"][0:50])]
  index=pd.pivot_table(tx_user, index='movieid', columns='userid', values='rating')
  new_order=pop_by_data(date1,date2)["movieid"][0:50].tolist()
  index=index.reindex(new_order)
  return index


def r_lista_after(userid,rp):
  return rp[np.isnan(rp[userid])!=True].index.tolist()
def table_after(date1,date2,date3):
  rp=revert_pivot_after(date1,date2,date3)
  users=rp.columns.to_numpy()
  vfunc = np.vectorize(r_lista_after, excluded=["rp"], otypes=[list])
  return pd.DataFrame(data={"userid":users, "seen_foll_month":vfunc(users,rp=rp)})
verif=table_after('01/05/2007','01/06/2007','01/07/2007')


def is_true(userid,r):
  u=pd.DataFrame(data={"movies": r.loc[userid,"seen_foll_month"]+r.loc[userid,"recommended_p_movieids"]})
  return len(u[u.duplicated()])
def label(r):
  users=r.dropna(subset=['seen_foll_month'])["userid"].to_numpy()

  vfunc = np.vectorize(is_true, excluded=["r"], otypes=[int])
  r_index=r.set_index('userid')
  num=vfunc(users,r=r_index)
  label=pd.DataFrame(data={"userid":users, "n_success_recommendations":num})
  r=pd.merge(r,label,on='userid',how='left')
  r["n_success_recommendations"].fillna(0,inplace=True)
  return r



def K_recomendacoes_populares(date1,date2,date3,N):
  recomendas=table(date1,date2,N)
  verif=table_after(date1,date2,date3)
  recomendas=pd.merge(recomendas,verif,on='userid',how='left')
  recomendas=label(recomendas)
  recomendas["seen_foll_month"].fillna(0,inplace=True)
  recomendas["recomlen"]=recomendas["recommended_p_movieids"].apply(len)
  recomendas["nºvistas"]=recomendas["seen_foll_month"].apply(lambda x: len(x) if x!=0 else 0)
  recomendas["recall"]=recomendas["n_success_recommendations"]/recomendas["nºvistas"]
  recomendas["recall"].fillna(0,inplace=True)
  return recomendas 

r=K_recomendacoes_populares('01/05/2007','01/06/2007','01/07/2007',5)
r.iloc[595:599]
Out[27]:
userid recommended_p_movieids seen_foll_month n_success_recommendations recomlen nºvistas recall
595 91483 [48019, 39384, 47408, 42237, 16752] 0 0.0 5 0 0.000000
596 91681 [39384, 47408, 16752, 45119, 49294] [39384, 47408, 20409] 2.0 5 3 0.666667
597 91743 [48019, 39384, 36449, 47408, 42237] 0 0.0 5 0 0.000000
598 91789 [48019, 39384, 47408, 16752, 45119] [47408, 50010, 18962] 1.0 5 3 0.333333


  Since we are using training data, our recommendations did not really affect the choices of the users. As the results illustrate, there are many users that have actually not watch any movie in the following month, and when giving them recommendations we are not distinguishing between a user that is going to stay active in the site and one that is not.
  For this reason, when calculating precision, recall and other performance measures, we only consider users that have watched at least one movie in the test time period (the following month). Nevertheless in a real world context application, it would be important to make recommendations to users that are likely to become inactive, since that would probably remind them to keep watching movies by suggesting things that they are interested in.
  In the following table we have gathered the performance measures Precision, Recall, F1-score and percentage of users that have seen at least one of the recommendations for 3, 5, 10 or 20 recommendations. As was previously mentioned, there is an expected proportional relationship between the number of recommendations and the recall. So although we have a higher score with 20 recommendations, we won't be choosing this as the best model, not only because they are too many, but also because they give a very low accuracy. After considering the advantages of each number of recommendations, we have decided to choose 5 as the best model.

In [36]:
precision=[]
percentage=[]
n=[]
recall=[]
f_score=[]
for i in [3,5,10,20]:
  n.append(i)
  r=K_recomendacoes_populares('01/05/2007','01/06/2007','01/07/2007',i)
  precision.append(str(np.mean(r.query("seen_foll_month != 0")["n_success_recommendations"]/i)*100)[0:4]+"%")
  recall.append(str(np.mean(r.query("seen_foll_month != 0")["recall"])*100)[0:4]+"%")
  p=np.mean(r.query("seen_foll_month != 0")["n_success_recommendations"]/i)*100
  R=np.mean(r.query("seen_foll_month != 0")["recall"])*100
  f_score.append(str(2*p*R/(R+p))[0:4]+"%")
  percentage.append(str(r.query("n_success_recommendations > 0")["n_success_recommendations"].count()/r.query("seen_foll_month != 0")["seen_foll_month"].count()*100)[0:4]+"%")
measures = pd.DataFrame(data={"Precision":precision,"Recall":recall,"F1":f_score,"% of users that watched 1 or more recommendations":percentage, "n of recommendations":n})
measures = measures.set_index('n of recommendations')
measures
Out[36]:
Precision Recall F1 % of users that watched 1 or more recommendations
n of recommendations
3 25.1% 33.3% 28.6% 57.5%
5 21.2% 42.3% 28.2% 68.2%
10 15.5% 52.6% 23.9% 77.7%
20 12.3% 71.1% 21.0% 87.6%

Popularity considering age groups



  We later decided to try applying our knowledge of the different preferences between age groups to build another recommending system based on popularity. We follow a very similar approach as before, but now we recommend the most popular movies within the age group that the user belongs to.

  In order to do that we first compute the top movies for each age group only once, an then give the recommendations according to the group of each user. The results are represented in the following table, for the same performance measures and different numbers of recommendations.

  Contrary to what one could have thought the results are slightly worse than the ones of the simplest popularity model. This can be due to the age distribution, since as we know there are very few users younger than 15 and older than 40 years old. For these reasons we have decided to stick with the first popularity algorithm.
In [74]:
def revert_pivot_after(date1,date2,date3):
  d2=pd.to_datetime(date2, format='%d/%m/%Y')
  d3=pd.to_datetime(date3, format='%d/%m/%Y')
  tx_user = ratings.loc[(ratings['date'] <= d3) & (ratings['date'] >= d2)]
  tx_user=tx_user[tx_user.movieid.isin(pop_by_data(date1,date2)["movieid"][0:50])]
  index=pd.pivot_table(tx_user, index='movieid', columns='userid', values='rating')
  new_order=pop_by_data(date1,date2)["movieid"][0:50].tolist()
  index=index.reindex(new_order)
  return index


def r_lista_after(userid,rp):
  return rp[np.isnan(rp[userid])!=True].index.tolist()
def table_after(date1,date2,date3):
  rp=revert_pivot_after(date1,date2,date3)
  users=rp.columns.to_numpy()
  vfunc = np.vectorize(r_lista_after, excluded=["rp"], otypes=[list])
  return pd.DataFrame(data={"userid":users, "seen_foll_month":vfunc(users,rp=rp)})
verif=table_after('01/05/2007','01/06/2007','01/07/2007')


def is_true(userid,r):
  u=pd.DataFrame(data={"movies": r.loc[userid,"seen_foll_month"]+r.loc[userid,"recommended_p_movieids"]})
  return len(u[u.duplicated()])
def label(r):
  users=r.dropna(subset=['seen_foll_month'])["userid"].to_numpy()

  vfunc = np.vectorize(is_true, excluded=["r"], otypes=[int])
  r_index=r.set_index('userid')
  num=vfunc(users,r=r_index)
  label=pd.DataFrame(data={"userid":users, "n_success_recommendations":num})
  r=pd.merge(r,label,on='userid',how='left')
  r["n_success_recommendations"].fillna(0,inplace=True)
  return r



def K_recomendacoes_populares(date1,date2,date3,N):
  recomendas=table(date1,date2,N)
  verif=table_after(date1,date2,date3)
  recomendas=pd.merge(recomendas,verif,on='userid',how='left')
  recomendas=label(recomendas)
  recomendas["seen_foll_month"].fillna(0,inplace=True)
  recomendas["recomlen"]=recomendas["recommended_p_movieids"].apply(len)
  recomendas["nºvistas"]=recomendas["seen_foll_month"].apply(lambda x: len(x) if x!=0 else 0)
  recomendas["recall"]=recomendas["n_success_recommendations"]/recomendas["nºvistas"]
  recomendas["recall"].fillna(0,inplace=True)
  return recomendas 


def pop_by_data_age(date1,date2,age1,age2):
    d1=pd.to_datetime(date1, format='%d/%m/%Y')
    d2=pd.to_datetime(date2, format='%d/%m/%Y')
    mask = (ratings['date'] >= d1) & (ratings['date'] <= d2)
    tx_data=ratings.loc[mask]
    tx_data=pd.merge(tx_data,movies, on="movieid", how="left")
    tx_data=pd.merge(tx_data,users, on="userid", how="left")
    mask2 = (tx_data["age"] >= age1) & (tx_data["age"] < age2)
    tx_data = tx_data.loc[mask2]
    views = tx_data.groupby(['movieid'])['userid'].count().reset_index()
    views=views.sort_values(by='userid', ascending=False)
    views=views.rename(columns={"userid": "Views"})
    views=pd.merge(views,movies, on="movieid", how="left")
    return views


def revert_pivot_age(date1,date2,age1,age2):
    d2=pd.to_datetime(date2, format='%d/%m/%Y')
    tx_user = ratings.loc[ratings['date'] <= d2]
    tx_user = pd.merge(tx_user,users, on="userid", how="left")
    mask2 = (tx_user["age"] >= age1) & (tx_user["age"] < age2)
    tx_user = tx_user.loc[mask2]
    tx_user=tx_user[tx_user.movieid.isin(pop_by_data_age(date1,date2,age1,age2)["movieid"][0:50])]
    index=pd.pivot_table(tx_user, index='movieid', columns='userid', values='rating')
    new_order=pop_by_data_age(date1,date2,age1,age2)["movieid"][0:50].tolist()
    index=index.reindex(new_order)
    return index

def table_age(date1,date2,N,age1,age2):
    rp=revert_pivot_age(date1,date2,age1,age2)
    users=rp.columns.to_numpy()
    vfunc = np.vectorize(r_lista, excluded=["rp","n"], otypes=[list])
    return pd.DataFrame(data={"userid":users, "recommended_p_movieids":vfunc(users,rp=rp,n=N)})

def K_recomendacoes_populares_age(date1,date2,date3,N):
    recomendas=table_age(date1,date2,N,5,15)
    for i in [[15,25],[25,35],[35,45],[45,55],[55,65]]:
        recomendas= recomendas.append(table_age(date1,date2,N,i[0],i[1]), ignore_index=True)
  
    recomendas.drop_duplicates(subset ="userid", keep = "first", inplace = True)
    verif=table_after(date1,date2,date3)
    recomendas=pd.merge(recomendas,verif,on='userid',how='left')
    recomendas=label(recomendas)
    recomendas["seen_foll_month"].fillna(0,inplace=True)
    recomendas["nºvistas"]=recomendas["seen_foll_month"].apply(lambda x: len(x) if x!=0 else 0)
    recomendas["recall"]=recomendas["n_success_recommendations"]/recomendas["nºvistas"]
    recomendas["recall"].fillna(0,inplace=True)
    return recomendas 

precision=[]
percentage=[]
n=[]
recall=[]
f_score=[]
for i in [3,5,10,20]:
    n.append(i)
    r=K_recomendacoes_populares_age('01/05/2007','01/06/2007','01/07/2007',i)
    precision.append(str(np.mean(r.query("seen_foll_month != 0")["n_success_recommendations"]/i)*100)[0:4]+"%")
    recall.append(str(np.mean(r.query("seen_foll_month != 0")["recall"])*100)[0:4]+"%")
    p=np.mean(r.query("seen_foll_month != 0")["n_success_recommendations"]/i)*100
    R=np.mean(r.query("seen_foll_month != 0")["recall"])*100
    f_score.append(str(2*p*R/(R+p))[0:4]+"%")
    percentage.append(str(r.query("n_success_recommendations > 0")["n_success_recommendations"].count()/r.query("seen_foll_month != 0")["n_success_recommendations"].count()*100)[0:4]+"%")

    

pop_by_data_age('01/05/2007','01/06/2007',30,50)
recomendas=table_age('01/05/2007','01/06/2007',10,20,30)   
revert_pivot_age('01/05/2007','01/06/2007',30,50)
measures = pd.DataFrame(data={"Precision":precision,"Recall":recall,"F1 score":f_score,"% of users that watched 1 or more recommendations":percentage, "n of recommendations":n})
measures = measures.set_index('n of recommendations')
print("Table of performance measures for popularity model considering age groups")
measures
Table of performance measures for popularity model considering age groups
Out[74]:
Precision Recall F1 score % of users that watched 1 or more recommendations
n of recommendations
3 24.3% 32.6% 27.8% 56.4%
5 20.6% 41.5% 27.5% 67.6%
10 14.5% 51.8% 22.7% 76.6%
20 10.3% 63.5% 17.8% 84.0%

Association Rules


  Association rules were originally applied in Market Basket Analysis with the purpose of finding unexpected patterns in the way consumers buy items.
  A transaction is the name given to a set of items being bought by a consumer. With the information of many transactions, performed by many users on several items, we aim to find the most frequent itemsets, meaning a group of items which are usually bought together. Based on this item sets we are able to build association rules that with a certain degree of confidence tells us that a particular itemset $Y$ is likely to be bought if the client buys $X$. This is $X \to Y$ where no items in $Y$ belong to $X$ and vice-versa ($X \cap Y = \emptyset)$.

  In practice, a association rule would tell us for instance that is a costumer goes to the store to buy wine and cheese, there is a high chance of him buying olives as well. So knowing this could help the store to display olives close to the wine or cheese sections.
  The most common algorithm for generating the best rules in our dataset is Apriori which implements 2 main steps:
  • Identification of frequent itemsets
  • Generation of rules

  The identification of frequent itemsets in the apriori algorithm is done in levels, from the smaller size itemsets to the larger, and with a generate-and-test strategy where at each iteration the new frequent itemsets candidates of size $k$ are generated based on the previously found frequent itemsets of $k-1$ size and prune using a minimum support. The support-based pruning eliminates itemsets with a support lower than the chosen minimum support. The support of an item is just the number of transactions in the dataset where the item appears, and the support of an itemset is the union of the transactions where its items appear: $sup(X \to Y) = sup(X \cup Y)$.

  The generation rules is done as following: first we compute all non-empty subsets $s$ of each frequent itemset $I$ and for each subset we calculate the confidence of the rule $(I-s) \to s$, where the confidence is a measure of the strength of the rule and is given by the percentage of transactions that having the antecedents, also have the descendants $conf(X \to Y) = \frac{sup(X \cup Y)}{sup(X)}$. Then we just prune the rules by eliminating the ones with confidence lower than the minimum.

Implementation


  In our case of study we do not have items being bought by consumers but movies being watched by users. This is from the beginning a limitation because the patterns of the user's choices of movies should not be as strong as when people buy 2 items together in one trip to the supermarket. The user does not watch more than one movie at the same time, but we can try to guess which movies a user is likely to watch and like based on what other users that have seen similar stuff have watched. Given this, our task is to build a model that consists in a list of rules that best describe the patterns in choices of the users in the dataset.
  In order to do this we start by organizing our data in the shape of transactions where each transaction is the set of movies seen by a user until a certain point in time. In the following table we have an example of sets of movies seen by users between May and June of 2007.
In [9]:
users=pd.read_csv("users.csv",delimiter=";")
movies=pd.read_csv("movies.csv",delimiter=";",encoding = "ISO-8859-1")
ratings=pd.read_csv("ratings.csv",sep='\t', lineterminator='\r',encoding = "UTF-16 LE")
ratings['date']= pd.to_datetime(ratings['date']) 
users['memberfor']= pd.to_datetime(users['memberfor'],format='%d/%m/%Y %H:%M')
ratings.drop([8196077],inplace=True)

ratings.isnull().sum()
ratings['movieid']=pd.to_numeric(ratings['movieid'],downcast='integer')
ratings['userid']=pd.to_numeric(ratings['userid'],downcast='integer')


from mlxtend.preprocessing import TransactionEncoder
def revert_pivot_v(date1,date2):
  d1=pd.to_datetime(date1, format='%d/%m/%Y')
  d2=pd.to_datetime(date2, format='%d/%m/%Y')
  mask = (ratings['date'] >= d1) & (ratings['date'] <= d2)
  tx_user=ratings.loc[mask]
  min_movie_ratings = 0
  filter_movies = tx_user['movieid'].value_counts() > min_movie_ratings
  filter_movies = pop_by_data(date1, date2)["movieid"][0:50]

  min_user_ratings = 0
  filter_users = tx_user['userid'].value_counts() > min_user_ratings
  filter_users = filter_users[filter_users].index.tolist()
  ratings_new = tx_user[(tx_user['movieid'].isin(filter_movies)) & (tx_user['userid'].isin(filter_users))]
  index=pd.pivot_table(ratings_new, index='movieid', columns='userid', values='rating')
  return index

def viram(userid,rp):
  return rp[np.isnan(rp[userid])!=True].index.tolist()

def table_viram(date1,date2):
  rp=revert_pivot_v(date1,date2)
  users=rp.columns.to_numpy()
  vfunc = np.vectorize(viram, excluded=["rp"], otypes=[list])
  return pd.DataFrame(data={"userid":users, "seen":vfunc(users,rp=rp)})

trans=table_viram('01/05/2007','01/06/2007')
dataset=trans["seen"].tolist()
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)

trans.iloc[:3]
/home/bmf-pinto/.local/lib/python3.6/site-packages/IPython/core/interactiveshell.py:3051: DtypeWarning:

Columns (0) have mixed types.Specify dtype option on import or set low_memory=False.

Out[9]:
userid seen
0 288 [13153, 16752, 36449, 47207, 47408, 56915]
1 729 [48019]
2 921 [34815, 48019]
In [10]:
print(len(trans))
6898



To generate our itemsets we filter the movie ratings from users that have seen at least one movie of the 50 most popular movies in the last month and use the from the package . Following this we can use the methods in the apriori algorithm to find the most frequent itemsets and then generate the rules. In the next table we have a sample of some frequent itemsets, their length and support for a pruning with minimum support equal to 0.03. Generally we start with a small support of around $minsup=\frac{10}{number \, of \, transactions}$, which in our case would give approximately 0.0015, but our machines took too much time to compute rules with such small support, so we had to make this number bigger to have fewer itemsets to work with.

In [11]:
from mlxtend.frequent_patterns import apriori
frequent_itemsets = apriori(df, min_support=0.03, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets
frequent_itemsets=frequent_itemsets.sort_values(by="length",ascending=False)
frequent_itemsets.head()
Out[11]:
support itemsets length
1700 0.030734 (20644, 20645, 56915, 56916, 24251) 5
1687 0.030734 (56915, 24251, 56916, 45119) 4
1675 0.037257 (42237, 24251, 20644, 20645) 4
1677 0.032038 (24251, 20644, 20645, 49294) 4
1678 0.034213 (56915, 24251, 20644, 20645) 4
... ... ... ...
31 0.090026 (44439) 1
30 0.146419 (42237) 1
29 0.233836 (39384) 1
28 0.233836 (36449) 1
0 0.111482 (211) 1

1701 rows × 3 columns


Then we use the routine association_rules from the same package with a confidence threshold of 0.4. Here we ca see some of the best rules generated. For instance, in the first row we have that who has seen the 3 movies which are the antecedents of the rule is likely to watch the movie with id 56916 (the consequent) with a confidence of 91.5%. In total the algorithm found 3669 rules for around 7000 users.

In [12]:
from mlxtend.frequent_patterns import association_rules
rules=association_rules(frequent_itemsets, metric="confidence", min_threshold=0.4)
rules=rules.sort_values(by=["confidence","support","lift"],ascending=[False,False,False])

rules.head()
Out[12]:
antecedents consequents antecedent support consequent support support confidence lift leverage conviction
113 (56915, 42237, 20645) (56916) 0.033923 0.112206 0.031023 0.914530 8.150423 0.027217 10.387185
0 (56915, 20644, 20645, 56916) (24251) 0.033778 0.110757 0.030734 0.909871 8.215042 0.026992 9.866366
131 (42237, 56915, 24251) (56916) 0.033923 0.112206 0.030734 0.905983 8.074251 0.026927 9.442895
46 (20644, 20645, 49294) (24251) 0.035518 0.110757 0.032038 0.902041 8.144342 0.028104 9.077692
35 (20644, 42237, 20645) (24251) 0.041316 0.110757 0.037257 0.901754 8.141756 0.032681 9.051226



Now to have the final model we just have to choose rules for each user according to the movies he has watched in the past. In the following table we have 5 of the 24 rules that match the user with userid number 84 and their confidence levels. And according to the number or rules we wish to give to each user we recommend the movies that are consequents of those rules (which are the rules with higher confidence for that user).

In [15]:
def viram2(userid,rp):
  return set(rp[np.isnan(rp[userid])!=True].index.tolist())

def viram_same(date1,date2):
  p=revert_pivot(date1,date2)
  users=p.columns.to_numpy()
  vfunc = np.vectorize(viram2, excluded=["rp"], otypes=[set])
  return pd.DataFrame(data={"userid":users, "viu":vfunc(users,rp=p)})

rec_a_rules = viram_same('01/05/2007','01/06/2007')
rec_a_rules=rec_a_rules.set_index('userid')
rules[((rules['antecedents'] >= rec_a_rules.loc[84,'viu']) | (rules['antecedents'] <= rec_a_rules.loc[84,'viu'])) & ((rules["consequents"]-rec_a_rules.loc[84,'viu'])==rules["consequents"])].groupby(["consequents"]).max()
Out[15]:
antecedents antecedent support consequent support support confidence lift leverage conviction
consequents
(34061) (52672) 0.128443 0.098144 0.058568 0.822642 8.381951 0.045962 5.084931
(56370) (47207) 0.091476 0.092925 0.039867 0.439542 4.730053 0.031366 1.618454
(10376) (17971) 0.099739 0.088431 0.041606 0.468954 5.303027 0.033761 1.716554
(58104, 34061) (26656, 45119) 0.059727 0.052769 0.030879 0.516990 9.797250 0.027727 1.961102
(36449) (45119) 0.128443 0.233836 0.051464 0.521845 2.231670 0.021738 1.602333
(48019) (49294) 0.233836 0.404465 0.115831 0.551402 1.363287 0.021252 1.327547
(9180) (6632) 0.100464 0.095245 0.042911 0.564171 5.923367 0.033411 2.075941
(54903) (52672, 49294) 0.110757 0.093215 0.055523 0.589385 6.322832 0.045199 2.208360
(55633) (26656) 0.100464 0.089591 0.042476 0.601140 6.709807 0.033540 2.282525
(8039) (19726) 0.100464 0.091766 0.043056 0.626471 6.826847 0.034662 2.431493
(36096) (56915, 56916) 0.100464 0.090606 0.043781 0.627273 6.923084 0.034678 2.439838
(64081) (49294) 0.125399 0.091186 0.050159 0.661342 7.252681 0.038725 2.683574
(35707) (56916) 0.112206 0.091041 0.047695 0.699346 7.681674 0.038615 3.023277
(14813) (11322) 0.100464 0.093650 0.046535 0.702341 7.499612 0.037937 3.044928
(31831) (11322) 0.128443 0.094230 0.055088 0.702875 7.459130 0.042985 3.048451
(20644, 20645) (56915, 56916) 0.110757 0.069730 0.058278 0.704319 10.100607 0.050555 3.146193
(25422) (45119) 0.128443 0.090316 0.052624 0.706790 7.825744 0.041023 3.102501
(28315) (56915, 56916) 0.100464 0.088431 0.047115 0.725753 8.206952 0.038295 3.323890
(42237) (52672) 0.128443 0.146419 0.059872 0.730263 4.987480 0.041849 3.164494
(58104) (24251) 0.128443 0.108147 0.056393 0.774194 7.158696 0.043722 3.949633
(20645) (11322) 0.128443 0.105393 0.070890 0.797342 7.565428 0.059217 4.414373
(20644) (52672) 0.128443 0.108727 0.074369 0.820598 7.547313 0.062327 4.968021
(58804) (17971) 0.100464 0.088286 0.042041 0.434641 4.923071 0.033236 1.612626
(44439) (17971) 0.091476 0.090026 0.036967 0.413399 4.591987 0.028732 1.551265


Finally in the following table we gathered the information about about the set of movies recommended to the user based on the 10 rules with higher confidence that apply to him, the set of movies he has effectively watched and the respective recall and precision measures.

In [22]:
def conf(userid,rules,rec_a_rules,k):
  re=pd.DataFrame(data=set(rules[( ( rules['antecedents'] >= rec_a_rules.loc[userid,'viu'] ) | ( rules['antecedents'] <= rec_a_rules.loc[userid,'viu'] ) ) & ( (rules["consequents"]-rec_a_rules.loc[userid,'viu'])==rules["consequents"] )].groupby(["consequents"]).max().index.to_list()[0:k]))
  n=len(re.columns)
  if (n!=0):
    data=re[0].to_list()
    for i in range(1,n):
      data+=re[i].to_list()
    re2=pd.DataFrame(data=data)
    re2=re2.drop_duplicates(keep='first')
    re2=re2.dropna()
    return re2[0].to_list()
  else:
    return []

def rec_rules(rules,rec_a_rules,N):
  users=rec_a_rules.index.to_numpy()
  vfunc = np.vectorize(conf, excluded=["rules","rec_a_rules","k"], otypes=[list])
  return pd.DataFrame(data={"userid":users, "recomendas":vfunc(users,rules=rules,rec_a_rules=rec_a_rules,k=N)})

def r_lista_after(userid,rp):
  return rp[np.isnan(rp[userid])!=True].index.tolist()

def table_after(date1,date2,date3):
  rp=revert_pivot_after(date1,date2,date3)
  users=rp.columns.to_numpy()
  vfunc = np.vectorize(r_lista_after, excluded=["rp"], otypes=[list])
  return pd.DataFrame(data={"userid":users, "seen_foll_month":vfunc(users,rp=rp)})



def revert_pivot_after2(date1,date2,date3):
  d2=pd.to_datetime(date2, format='%d/%m/%Y')
  d3=pd.to_datetime(date3, format='%d/%m/%Y')
  tx_user = ratings.loc[(ratings['date'] <= d3) & (ratings['date'] >= d2)]
  tx_user=tx_user[tx_user.movieid.isin(revert_pivot_v(date2,date3).index)]
  index=pd.pivot_table(tx_user, index='movieid', columns='userid', values='rating')
  return index

def table_after_2(date1,date2,date3):
  rp=revert_pivot_after2(date1,date2,date3)
  users=rp.columns.to_numpy()
  vfunc = np.vectorize(r_lista_after, excluded=["rp"], otypes=[set])
  return pd.DataFrame(data={"userid":users, "seen_foll_month":vfunc(users,rp=rp)})


def is_true2(userid,r):
  u=pd.DataFrame(data={"movies": r.loc[userid,"seen_foll_month"]+r.loc[userid,"recomendas"]})
  return len(u[u.duplicated()])

def label2(r):
  users=r.dropna(subset=['seen_foll_month'])["userid"].to_numpy()

  vfunc = np.vectorize(is_true2, excluded=["r"], otypes=[int])
  r_index=r.set_index('userid')
  num=vfunc(users,r=r_index)
  label=pd.DataFrame(data={"userid":users, "n_success_recommendations":num})
  r=pd.merge(r,label,on='userid',how='left')
  r["n_success_recommendations"].fillna(0,inplace=True)
  return r




def K_recomendacoes_associa(date1,date2,date3,k):
  trans=table_viram(date1,date2)
  dataset=trans["seen"].tolist()
  te = TransactionEncoder()
  te_ary = te.fit(dataset).transform(dataset)
  df = pd.DataFrame(te_ary, columns=te.columns_)
  frequent_itemsets = apriori(df, min_support=0.025, use_colnames=True)
  rules=association_rules(frequent_itemsets, metric="confidence", min_threshold=0.5)
  rules=rules.sort_values(by=["confidence","support","lift"],ascending=[False,False,False])
  
  rec_a_rules = viram_same(date1,date2)
  rec_a_rules=rec_a_rules.set_index('userid')
  recs= rec_rules(rules,rec_a_rules,k)
  verif=table_after_2('01/05/2007','01/06/2007','01/07/2007')
  recs2=pd.merge(recs,verif,on='userid',how='left')
                           
  recs2=label2(recs2)
  recs2["seen_foll_month"].fillna(0,inplace=True)
  recs2["recomlen"]=recs2["recomendas"].apply(len)
  recs2["P"]=recs2["n_success_recommendations"]/recs2["recomlen"]
  recs2["P"].fillna(0,inplace=True)
  recs2["n_watched"]=recs2["seen_foll_month"].apply(lambda x: len(x) if x!=0 else 0)
  recs2["R"]=recs2["n_success_recommendations"]/recs2["n_watched"]
  recs2["R"].fillna(0,inplace=True)
  return recs2

recomendas= K_recomendacoes_associa('01/05/2007','01/06/2007','01/07/2007',10)
recomendas.head()
Out[22]:
userid recomendas seen_foll_month n_success_recommendations recomlen P n_watched R
0 84 [28315.0, 58104.0, 44439.0, 34061.0, 20644.0, ... 0 0.0 8 0.000000 0 0.000000
1 288 [49294.0, 35707.0, 24251.0, 20800.0, 17971.0, ... [8545, 17971, 31831, 36866, 49294, 52794, 55421] 2.0 11 0.181818 7 0.285714
2 729 [24251, 16752, 56915, 34815, 49294, 36449, 569... 0 0.0 10 0.000000 0 0.000000
3 767 [56915.0, 58104.0, 14813.0, 11322.0, 45119.0, ... 0 0.0 11 0.000000 0 0.000000
4 792 [35707, 64081, 25422, 56915, 52672, 55633, 283... 0 0.0 10 0.000000 0 0.000000


And then we studied the performance for the usual measures of the algorithm when giving 3, 5, 10 or 20 rules with a minimum support of 0.025 and minimum confidence of 0.5. In general the performance is not as good as expected, or even as good as the one in the very simple popularity algorithm. But one thing we can notice is that the precision does not decrease as much with a higher number of recommendations (and rules) as in the popularity model, in fact our F1 score is increasing with the number of rules. So in this case if we had a better machine to compute the rules with our large dataset we could give more rules and with lower support that would probably fit more users with more particular itemsets.

In [23]:
precision=[]
percentage=[]
n=[]
recall=[]
f_score=[]
for i in [3,5,10,20]:
  n.append(i)
  r=K_recomendacoes_associa('01/05/2007','01/06/2007','01/07/2007',i)
  precision.append(str(np.mean(r.query("seen_foll_month != 0")["P"])*100)[0:4]+"%")
  recall.append(str(np.mean(r.query("seen_foll_month != 0")["R"])*100)[0:4]+"%")
  p=np.mean(r.query("seen_foll_month != 0")["P"])*100
  R=np.mean(r.query("seen_foll_month != 0")["R"])*100
  f_score.append(str(2*p*R/(R+p))[0:4]+"%")
  percentage.append(str(r.query("n_success_recommendations > 0")["n_success_recommendations"].count()/r.query("seen_foll_month != 0")["n_success_recommendations"].count()*100)[0:4]+"%")
measures = pd.DataFrame(data={"Precision":precision,"Recall":recall,"F1":f_score,"% of users that watched 1 or more recommendations":percentage, "n of rules":n})
measures = measures.set_index('n of rules')

measures
Out[23]:
Precision Recall F1 % of users that watched 1 or more recommendations
n of rules
3 10.5% 6.28% 7.87% 23.2%
5 9.93% 8.88% 9.37% 28.5%
10 9.74% 13.9% 11.4% 37.1%
20 9.56% 18.6% 12.6% 42.4%

Collaborative Filtering


  Collaborative Filtering Recommendations are based on the similarity of items and the knowledge that a user will likely consume the same items as other users with similar taste did in the past. To make recommendations we do not need to know any information about the content of the items or the profile of the users because item similarity is based only on the differences and resemblance between the sets of users that liked, watched or ignored those items.
  Our recommendations can be either user or item based and this basically means that our measures will be computed either row or column-wise, but in general the pipeline for their generation is the following:
  • According to a chosen measure compute the similarity function between items (similar items are rated in a similar way by the same user) or users (similar users have similar ratings on the same item);
  • create our model as the item-item or user-user similarity matrix;
  • find the k-nearest users or k-nearest items between the pairs in the matrix;
  • select N item-based recommendations, giving the user the items most similar to the ones he has seen, or N user-based recommendations by selecting items that are preferred to by his nearest users.

  This similarity measures can be any of the ones we already know such as the cosine-similarity, the jaccard index, the pearson correlation coefficient or the mean squared difference similarity.

Implementation



  For this model we decided to use a scikit package that is specialized in building and analyzing recommendation systems called surprise. This way it was possible to implement several prediction models ready to use with different similarity measures both user and item based.

  Before using building any model, we first did some analysis of the number of ratings per movie and per user distributions by plotting their histograms. This can help us understands which are the best thresholds we should use to determine if a movie or user are relevant or not, this is, the number of interactions they should have in order not be discarded from the final dataset we are going to be working with. This step is very important because it provides the possibility of reducing the dimensionality of our data and therefore speeds up our processing and running times. The resulting histograms are available in the interactive images of our collab notebook version.

  After choosing to discard users and movies with less than a thousand ratings in total, we begin looking at the time period of the data that our systems will have access to. We decided we were going to use any ratings from the beginning of the year 2006 to the end of the first semester of 2007. Following this, we had to use a reader to load our pre-processed dataframe in the correct format for the package to use it. And then we divided the data into training and test sets as usual, using the ratio 75%/25%, respectively and created a list of models from the surprise with different measures and ways of calculating the predictions.

  For each of the models (KNNBasic and KNNWithZScore) we tried 6 optinios with the possible combinations of similarity measures (cosine,msd,pearson) and for item and user based recommendations. After this we computed the predictions and compared the results based on the Root mean squared error to choose the model with the best performance. As we can see in the following table the first 2 models are the ones with lower error, but are very close together, we decided to work with the user-based KNNWithZScore model that uses person correlation as the similarity measure.

In [34]:
ratings=pd.read_csv("ratings.csv",sep='\t', lineterminator='\r',encoding = "UTF-16 LE")
ratings['date']= pd.to_datetime(ratings['date']) 

ratings.drop([8196077],inplace=True)

ratings2 = ratings.copy()
max_date = datetime(2007, 6, 1) 
ratings2 = ratings2.loc[ratings['date'] <= max_date]

min_movie_ratings = 1000
filter_movies = ratings2['movieid'].value_counts() > min_movie_ratings
filter_movies = filter_movies[filter_movies].index.tolist()

min_user_ratings = 1000
filter_users = ratings2['userid'].value_counts() > min_user_ratings
filter_users = filter_users[filter_users].index.tolist()

ratings_new = ratings2[(ratings2['movieid'].isin(filter_movies)) & (ratings2['userid'].isin(filter_users))]
print('The original data frame shape:\t{}'.format(ratings2.shape[0]))
print('The new data frame shape:\t{}'.format(ratings_new.shape[0]))

from surprise import Dataset
from surprise import Reader
from surprise import KNNBasic,KNNWithMeans,KNNWithZScore
from surprise.model_selection import train_test_split
from surprise import accuracy

min_date = datetime(2006, 1, 1) 


def filter_by_date(min_date,max_date,table):
  return table.loc[(table['date'] >= min_date) & (table['date'] <= max_date)]

ratings_new = filter_by_date(min_date,max_date,ratings_new)
ratings_new.movieid = ratings_new.movieid.astype(int)
n_users = ratings_new.userid.unique().shape[0]
n_items = ratings_new.movieid.unique().shape[0]
n_rows = ratings_new.shape[0]
print("Number of Rows: {}, Number of users: {} , Number of movies: {}".format(n_rows, n_users, n_items))

reader = Reader(rating_scale=(0, 5))
data = Dataset.load_from_df(ratings_new[['userid', 'movieid', 'rating']], reader)

# sample random trainset and testset
# test set is made of 25% of the ratings.
trainset, testset = train_test_split(data, test_size=.25)

algo_list = []

# Algorithm parameters to use 
sim_options1 = {'name': 'cosine',
               'user_based': False  #Item-based cosine similarity
               }
sim_options2 = {'name': 'msd',
               'user_based': False  #Item-based msd similarity
               }
sim_options3 = {'name': 'pearson',
               'user_based': False  #Item-based pearson correlation coeficient similarity
               }
sim_options4 = {'name': 'cosine',
               'user_based': True  #User-based cosine similarity
               }
sim_options5 = {'name': 'msd',
               'user_based': True  #User-based msd similarity
               }
sim_options6 = {'name': 'pearson',
               'user_based': True  #User-based pearson correlation coeficient similarity
               }

algo_list.append(KNNBasic(sim_options=sim_options1))
algo_list.append(KNNBasic(sim_options=sim_options2))
algo_list.append(KNNBasic(sim_options=sim_options3))
algo_list.append(KNNBasic(sim_options=sim_options4))
algo_list.append(KNNBasic(sim_options=sim_options5))
algo_list.append(KNNBasic(sim_options=sim_options6))
algo_list.append(KNNWithMeans(sim_options=sim_options1))
algo_list.append(KNNWithMeans(sim_options=sim_options2))
algo_list.append(KNNWithMeans(sim_options=sim_options3))
algo_list.append(KNNWithMeans(sim_options=sim_options4))
algo_list.append(KNNWithMeans(sim_options=sim_options5))
algo_list.append(KNNWithMeans(sim_options=sim_options6))
algo_list.append(KNNWithZScore(sim_options=sim_options1))
algo_list.append(KNNWithZScore(sim_options=sim_options2))
algo_list.append(KNNWithZScore(sim_options=sim_options3))
algo_list.append(KNNWithZScore(sim_options=sim_options4))
algo_list.append(KNNWithZScore(sim_options=sim_options5))
algo_list.append(KNNWithZScore(sim_options=sim_options6))

input_rows = []
# Iterate over all algorithms
for algorithm in algo_list:
    # Train the algorithm on the trainset, and predict ratings for the testset
    algorithm.fit(trainset)
    predictions = algorithm.test(testset)

    # Then compute RMSE
    result = accuracy.rmse(predictions,verbose=False)
    
    #Get algotithm parameters
    similarity = algorithm.sim_options['name']
    base = algorithm.sim_options['user_based']
    if base:
      base = 'user_based'
    else:
      base = 'item_based'

    #Add algorithm full name and result to input_rows list
    input_rows.append((str(algorithm).split(' ')[0].split('.')[-1]+"/"+similarity+"/"+base,result))

rows_list = []
for row in input_rows:
        dict1 = {}
        # get input row in dictionary format
        # key = col_name
        dict1.update(Algorithm = row[0], RMSE = row[1]) 
        rows_list.append(dict1)

performance_compare = pd.DataFrame(rows_list,columns=['Algorithm','RMSE'])
performance_compare.set_index('Algorithm').sort_values('RMSE')
The original data frame shape:	2763313
The new data frame shape:	184713
Number of Rows: 184706, Number of users: 389 , Number of movies: 656
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Out[34]:
RMSE
Algorithm
KNNWithMeans/msd/item_based 0.836501
KNNWithZScore/pearson/user_based 0.836544
KNNWithZScore/msd/item_based 0.837411
KNNWithZScore/pearson/item_based 0.838863
KNNWithMeans/pearson/item_based 0.839821
KNNWithMeans/cosine/item_based 0.840930
KNNBasic/msd/item_based 0.841757
KNNWithZScore/cosine/item_based 0.841821
KNNWithMeans/pearson/user_based 0.842749
KNNBasic/cosine/item_based 0.852973
KNNBasic/pearson/item_based 0.853420
KNNWithZScore/cosine/user_based 0.858875
KNNWithMeans/msd/user_based 0.861582
KNNWithMeans/cosine/user_based 0.862172
KNNWithZScore/msd/user_based 0.864663
KNNBasic/msd/user_based 0.900936
KNNBasic/pearson/user_based 1.018147
KNNBasic/cosine/user_based 1.113517


After having the chosen model we made rank predictions to latter compute our recommending system. As in all other models we take advantage of numpy array vectorization to accelerate our calculations with such highly dimensional data. Here we can see the predicted rating score for a given movie and a given user head to head with the real prediction. The algorithm did not have access to the real rating, since this was done later in time, and yet it was able to predict almost exactly how much the user would enjoy the movie. Our model predicted a score of 3.56 for the user 288 and the movie 49294, and the real rating was 4, but users can't make non natural number scores, so this prediction is vary accurate.

In [35]:
def predict_rank(movieid,userid,model):
  preds = model.predict(userid,movieid) #estimar o ranting que o user daria
  return preds[3] # retorna o rating estimado
def r_lista_3(userid,rp,k,model):
  vfunc2 = np.vectorize(predict_rank, excluded=["userid","movielist","ranklist","model"], otypes=[list])
  v2 = rp[np.isnan(rp[userid])].index.to_numpy()
  table = pd.DataFrame(data={"movieid": v2, "rank": vfunc2(v2, userid=userid,model=model)}) #para cada filme dos 50 populares que o user não viu obtenho uma tabela com o filme e o rating estimado para esse par user-filme
  table = table.sort_values(by="rank", ascending=False) #organizo a tabela por ranting estimado 
  return table["movieid"].to_list()[0:k] #devolvo os k filmes com rating estimado mais alto
def table_3(date1,date2,n,model):
  rp=revert_pivot(date1,date2) 
  users=rp.columns.to_numpy()
  vfunc = np.vectorize(r_lista_3, excluded=["rp","k","model"], otypes=[list])
  return pd.DataFrame(data={"userid":users, "recomRank":vfunc(users,rp=rp,k=n,model=model)}) #obtenho a lista de filmes recomendados para cada user organizada pela estimativa de rating
#recomendas=table_3('01/05/2007','01/06/2007',10)

#recomendas=table_3('01/05/2007','01/06/2007',10)
#verif=table_after('01/05/2007','01/06/2007','01/07/2007')
#revert_pivot_after('01/05/2007','01/06/2007','01/07/2007').head()
#recomendas=pd.merge(recomendas,verif,on='userid',how='left')
#recomendas.head()

def is_true_3(userid,r):
  u=pd.DataFrame(data={"movies": r.loc[userid,"seen_foll_month"]+r.loc[userid,"recomRank"]})
  return len(u[u.duplicated()])
def label_3(r):
  users=r.dropna(subset=['seen_foll_month'])["userid"].to_numpy()

  vfunc = np.vectorize(is_true_3, excluded=["r"], otypes=[int])
  r_index=r.set_index('userid')
  num=vfunc(users,r=r_index)
  label=pd.DataFrame(data={"userid":users, "n_success_recommendations":num})
  r=pd.merge(r,label,on='userid',how='left')
  r["n_success_recommendations"].fillna(0,inplace=True)
  return r
#recomendas=label_3(recomendas)
#recomendas.head()

model = algo_list[-1]
model.fit(trainset)
print(model.predict(288,49294.0))
ratings.query('userid == 288 & movieid == 49294.0')
Computing the pearson similarity matrix...
Done computing similarity matrix.
user: 288        item: 49294.0    r_ui = None   est = 3.56   {'was_impossible': True, 'reason': 'User and/or item is unkown.'}
Out[35]:
userid movieid rating date
40049 288 49294.0 4.0 2007-06-30


And now we can make our recommendations based on the ratings and the usual approach. In the following table we show a sample for the resulting model with 10 given recommendations for the month of June.

In [53]:
def K_recomendacoes_ranking(date1,date2,date3,N,algo):
  ratings2 = ratings.copy()
  max_date = datetime(2007, 6, 1) 
  ratings2 = ratings2.loc[ratings['date'] <= max_date]

  min_movie_ratings = 1000
  filter_movies = ratings2['movieid'].value_counts() > min_movie_ratings
  filter_movies = filter_movies[filter_movies].index.tolist()

  min_user_ratings = 1000
  filter_users = ratings2['userid'].value_counts() > min_user_ratings
  filter_users = filter_users[filter_users].index.tolist()

  ratings_new = ratings2[(ratings2['movieid'].isin(filter_movies)) & (ratings2['userid'].isin(filter_users))]

  min_date = datetime(int(date1[6:10]), int(date1[3:5]), int(date1[0:2])) 
  max_date = datetime(int(date2[6:10]), int(date2[3:5]), int(date2[0:2])) 

  def filter_by_date(min_date,max_date,table):
    return table.loc[(table['date'] >= min_date) & (table['date'] <= max_date)]

  ratings_new = filter_by_date(min_date,max_date,ratings_new)
  ratings_new.movieid = ratings_new.movieid.astype(int)
  n_users = ratings_new.userid.unique().shape[0]
  n_items = ratings_new.movieid.unique().shape[0]
  n_rows = ratings_new.shape[0]

  reader = Reader(rating_scale=(0, 5))
  data = Dataset.load_from_df(ratings_new[['userid', 'movieid', 'rating']], reader)
  trainset, testset = train_test_split(data, test_size=.25)
  model = algo
  model.fit(trainset)

  recomendas=table_3(date1,date2,N,model)
  verif=table_after(date1,date2,date3)
  recomendas=pd.merge(recomendas,verif,on='userid',how='left')
  recomendas=label_3(recomendas)
  recomendas["seen_foll_month"].fillna(0,inplace=True)
  recomendas["recomlen"]=recomendas["recomRank"].apply(len)
  recomendas["nºvistas"]=recomendas["seen_foll_month"].apply(lambda x: len(x) if x!=0 else 0)
  recomendas["recall"]=recomendas["n_success_recommendations"]/recomendas["nºvistas"]
  recomendas["recall"].fillna(0,inplace=True)
  return recomendas

K_recomendacoes_ranking('01/05/2007','01/06/2007','01/07/2007',10,algo_list[-1]).head()
Computing the pearson similarity matrix...
Done computing similarity matrix.
Out[53]:
userid recomRank seen_foll_month n_success_recommendations recomlen nºvistas recall
0 84 [48019.0, 36449.0, 10376.0, 28315.0, 55633.0, ... 0 0.0 10 0 0.00
1 288 [48019.0, 36096.0, 15587.0, 56370.0, 11322.0, ... [49294.0, 31831.0, 14813.0, 17971.0] 1.0 10 4 0.25
2 729 [39384.0, 9180.0, 14813.0, 54903.0, 15587.0, 5... 0 0.0 10 0 0.00
3 767 [39384.0, 36449.0, 35994.0, 9180.0, 31831.0, 1... 0 0.0 10 0 0.00
4 792 [48019.0, 9180.0, 14813.0, 54903.0, 15587.0, 5... 0 0.0 10 0 0.00


Finally, as we did in all other approaches, we compute the precision, recall and f1 score given by the model we built when we recommend 3, 5, 10 or 20 movies. As we can see from the table, the increase of the recall and decrease of the precision with the number of recommendations result in a almost constant f1-score. Knowing this we decided to choose the model with 10 recommendations.

In [56]:
precision=[]
percentage=[]
n=[]
recall=[]
f_score=[]
for i in [3,5,10,20]:
  n.append(i)
  r=K_recomendacoes_ranking('01/05/2007','01/06/2007','01/07/2007',i,algo_list[-1])
  precision.append(str(np.mean(r.query("seen_foll_month != 0")["n_success_recommendations"]/i)*100)[0:4]+"%")
  recall.append(str(np.mean(r.query("seen_foll_month != 0")["recall"])*100)[0:4]+"%")
  p=np.mean(r.query("seen_foll_month != 0")["n_success_recommendations"]/i)*100
  R=np.mean(r.query("seen_foll_month != 0")["recall"])*100
  f_score.append(str(2*p*R/(R+p))[0:4]+"%")
  percentage.append(str(r.query("n_success_recommendations > 0")["n_success_recommendations"].count()/r.query("seen_foll_month != 0")["n_success_recommendations"].count()*100)[0:4]+"%")
measures = pd.DataFrame(data={"Precisão":precision,"Recall":recall,"F1":f_score,"Percentagem de users que viram 1 das recomendações":percentage, "Nº recomendações":n})
measures = measures.set_index('Nº recomendações')
measures
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Out[56]:
Precisão Recall F1 Percentagem de users que viram 1 das recomendações
Nº recomendações
3 18.2% 21.1% 19.5% 43.9%
5 14.9% 25.9% 18.9% 51.3%
10 12.5% 36.3% 18.6% 62.6%
20 11.2% 58.7% 18.8% 78.6%

Evaluation




  Now that we have our best models for each approach we have do compare their performances. In order to have a fair comparison we need to test them for the same data and periods of time.
  We decided to compute the 3 measures: precision, recall and f1-score for each of the models through a sliding window that transverses the period of time between January of 2007 and July of the same year. This dates are according to what we have learned in the exploratory data analysis, because this models only work with large amounts of data and this is the period with larger amounts of activity by the users.
  We had to exclude the association rules algorithm from this analysis because it takes too long to run, but since it had very low precisions and recalls in comparison to the other 2 methods we think it was not as necessary.
  In the collab version of this report we have the interactive plots that illustrate the evolution of this measures over time. Here we could not represent the images because of a package dependency issue. Still, in order to understand which one is the best model we computed the means of this measures and the results are listed bellow.

In [6]:
def measures_calculator(r):
  precision = str(np.mean(r.query("seen_foll_month != 0")["n_success_recommendations"]/r.query("seen_foll_month != 0")["recomlen"])*100)[0:4]
  recall = str(np.mean(r.query("seen_foll_month != 0")["recall"])*100)[0:4]
  p= np.mean(r.query("seen_foll_month != 0")["n_success_recommendations"]/r.query("seen_foll_month != 0")["recomlen"])*100
  R= np.mean(r.query("seen_foll_month != 0")["recall"])*100
  f_score = str(2*p*R/(R+p))[0:4]
  return precision, recall, f_score
In [76]:
P_r , P_p , P_f  = [] , [] , []

CF_r , CF_p , CF_f =  [] , [] , []
data=[]
months=['01','02','03','04','05','06','07','08']
for i in range(len(months)-2):
    date1='01/'+months[i]+'/2007'
    date2='01/'+months[i+1]+'/2007'
    date3='01/'+months[i+2]+'/2007'
    date='01/'+months[i+1]+'/2007'
    data.append(date)
    p1=measures_calculator(K_recomendacoes_populares(date1, date2,date3, 5))
    p2=measures_calculator(K_recomendacoes_ranking(date1, date2,date3, 10, algo_list[-1]))
   
    P_p.append(p1[0]) 
    P_r.append(p1[1]) 
    P_f.append(p1[2])
    CF_p.append(p2[0]) 
    CF_r.append(p2[1]) 
    CF_f.append(p2[2])



for i in range(len(P_r)):
    CF_r[i]=float(CF_r[i])
    CF_p[i]=float(CF_p[i])
    CF_f[i]=float(CF_f[i])
    P_r[i]=float(P_r[i])
    P_p[i]=float(P_p[i])
    P_f[i]=float(P_f[i])
    
CF_p=np.array(CF_p)
CF_r=np.array(CF_r)
CF_f=np.array(CF_f)
P_p=np.array(P_p)
P_r=np.array(P_r)
P_f=np.array(P_f)


print('average precision of collaborative filtering: ', np.mean(CF_p))
print('average precision of popularity: ', np.mean(P_p))
print('average recall of collaborative filtering: ', np.mean(CF_r))
print('average recall of popularity: ', np.mean(P_r))
print('average f1 score of collaborative filtering: ', np.mean(CF_f))
print('average f1 score of popularity: ', np.mean(P_f))
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
Computing the pearson similarity matrix...
Done computing similarity matrix.
average precision of collaborative filtering:  14.533333333333333
average precision of popularity:  17.083333333333332
average recall of collaborative filtering:  36.15
average recall of popularity:  30.933333333333326
average f1 score of collaborative filtering:  20.71666666666667
average f1 score of popularity:  21.933333333333334




  As one can see, precision and f-score are very slightly better in the population algorithm. But the interesting fact is that the collaborative filtering performed much better in terms of recall.

Conclusions



  In this project we were able to build recommendation systems based on 3 different approaches: Popularity, Association Rules and both item and user based Collaborative Filtering.

  Two different popularity models have been implemented, one simpler that gives a personalized list for the most watched movies and one recommending system that also considers age groups. For both implementations the results were satisfactory for this task and the performance measures were very similar. We achieved the best precision when giving 3 recommendations (around 25%) and the recall increases with the number of recommendations.

  The association rules algorithm was the one that performed the worst, the best precision was again for using only 3 rules and achieved only 10%. The recall also increases with the number of recommendations (and rules), but was much lower than the one in the popularity model. Our best f-score was only 12.6% for 20 rules and even for this high number of recommendations more than 50% of the users did not watch any of the movies recommended.

  The last approach to implement was Collaborative Filtering, we tried several different combinations of methods for calculating the nearest neighbors, similarity measures and used both item and user based recommendations. After testing the list of models the one we decided that worked best was the user-based KNNWithZScore that uses person correlation as the similarity measure. For the same time period and data as the previous algorithms this model achieved a slightly lower precision, but more stable, than the popularity approach, a f1 score around 20% that does not change much with the number of recommendations because of the also increasing recall.

  In order to better compare our models we decided to compute recommendations and test their performance throughout a time window in the year of 2007 (where we have higher volume of user ratings) and calculate their means. The average precision for the collaborative model was 14.5% and for popularity it was 17.1%. The f1 score was around 21% for both models but the collaborative filtering clearly outperformed popularity in terms of recall with and average of 36%.

  After this we wanted to created a more sophisticated implementation of a recommending system that would be used in a production environment. Therefore we created a form in our dynamic interactive report that generates recommendations according to the time window and approach of your choice within the ones developed in this project. This application can be accessed at the following link: https://colab.research.google.com/drive/1U9JTlHfXqQMcfnSSMu2v95gZN6Lj_3mN?usp=sharing.

  For future reference we would like to study some methods that we think could potentially boost our models' performances. The first thing we could implement is link analysis and community discovery to help us understand the relationships between users and also movies. After having this information we could build a more complex model that would know which approach between Collaborative Filtering, Popularity and Association Rules it should use to generate recommendations for a particular user.

  As a final note we would like to mention that in order to fully and correctly test our models in a real world application, it would probably be more suitable to implement a A-B test. In this method we select a percentage of the population of users to give recommendations and after we evaluate the impact of the recommendations in that group of users in comparison to the one which were not given any suggestions.

In [33]:
from IPython.display import HTML
display(HTML('<style>.prompt{width: 0px; min-width: 0px; visibility: collapse}</style>'))

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
Out[33]:
In [ ]: